# Transformers with Arbitrary Inductive Bias

We construct a large language model that simulates dynamic evaluation with a small gpt model. 


## Installation
Install necessary conda environment using 
conda env create -n icl_as_ft --file icl_as_ft.yml


## Run the code
Use the following commandline to run perplexity evaluation on wikitext-2, wikitext-103, and c4.


```bash
python Perplexity_eval.py \
--model_name_or_path gpt2 \
--cache_dir $cache_dir \
--construct_save_model $save \
--construct_load_model $load \
--construct_model_path $path \
--n_simulation_layers $n_simulation_layers \
--n_forward_backward $n_forward_backward \
--inner_lr $inner_lr \
--train_fraction $tr  \
--batch_size $bs \
--dataset $dataset \
--use_eval_set $eval \
--use_test_set $test \
--use_einsum $use_einsum \
--num_attention_heads $heads \
--hidden_size $hidden_size \
--num_blanks $num_blacks 
```

* `dataset`: c4/wikitext-2/wikitext-103
* `cache_dir`: Directory to store and load gpt2
* `construct_save_model`: Whether to save the constructed model
* `construct_load_model`: Whether to load model from a path
* `construct_model_path`: Path to load or save the constructed model
* `n_simulation_layers`: Number of layers to update during dynamic evaluation
* `n_forward_backward`:  Number of SGD steps
* `inner_lr`: Learning rate for dynamic evaluation; note that in our construction, gradients are summed over tokens (and not averaged)
* `train_fraction`: Fraction of subsequence to use for training!
* `batch_size`: Batch size for the forward passes
* `use_eval_set`: Use validation set?
* `use_test_set`: Use test set?
* `heads`: Number of attention heads in constructed model
* `hidden_size`: Embedding dimension of constructed model
* `num_blanks`: Number of attention heads in constructed model


